The document contains a detailed analysis of forecast sale expectations of High-End Vacuums for Vac-Attack company in December 2020 using historical data. This analysis aims to determine if the company will meet the target and ensure the stock is on hand to match the demand.
The report describes data cleaning, summary statistics, visualisation and analysis using the statistical model fitting, implemented with R programming language.
if (!require(tidyverse))
install.packages("tidyverse")
if (!require(lubridate))
install.packages("lubridate")
if (!require(plotly))
install.packages("plotly")
if (!require(car))
install.packages("stargazer")
if (!require(xgboost))
install.packages("xgboost")
library(tidyverse)
library(lubridate)
library(stargazer)
library(car)
library(xgboost)First import the Market data.
market_sale_col <- readr::read_csv("Data/MarketingCols.csv", col_names = FALSE)
market_sale_raw <- readr::read_csv("Data/MarketingSales.csv",
col_names = market_sale_col$X1)Then import the December data for prediction.
dec_col <- readr::read_csv("Data/DecemberCols.csv", col_names = FALSE)
dec_ad_raw <- readr::read_csv("Data/DecemberAdData.csv",
col_names = dec_col$X1)Quick explore of the read-in the data sets
dim(market_sale_raw)## [1] 1796 11
glimpse(market_sale_raw)## Rows: 1,796
## Columns: 11
## $ Date <chr> "1/01/16", "2/01/16", "3/01/16", "4/01/16", "5/…
## $ PositiveNews <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ NegativeCoverage <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Competition <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AdvertisingSpend <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, …
## $ Month <chr> "January", "January", "January", "January", "Ja…
## $ Day <chr> "Friday", "Saturday", "Sunday", "Monday", "Tues…
## $ `0508Line_247` <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ UltraEdition_Available <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COVID_Lockdown <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ Sales <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83,…
dim(dec_ad_raw)## [1] 31 4
glimpse(dec_ad_raw)## Rows: 31
## Columns: 4
## $ Date <chr> "1/12/20", "2/12/20", "3/12/20", "4/12/20", "5/12/20"…
## $ AdvertisingSpend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17…
## $ Month <chr> "December", "December", "December", "December", "Dece…
## $ Day <chr> "Tuesday", "Wednesday", "Thursday", "Friday", "Saturd…
The next step is to tidy up the data sets for visualisation and
modelling. I have also included year, month and day variables in both
factor and numeric formats.
market_sale_final <-
market_sale_raw %>%
mutate(Date = dmy(Date),
Ad_Spend = AdvertisingSpend,
Phone_24 = factor(`0508Line_247`),
Positive_News = factor(PositiveNews),
Negative_News = factor(NegativeCoverage),
Competition = factor(Competition),
Ultra_Edition = factor( UltraEdition_Available),
COVID_Lockdown = factor(COVID_Lockdown),
Year = factor(year(Date)),
Month = factor(Month, levels = month.name),
Day = factor(Day, levels =
c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")),
Month_num = as.numeric(Month),
Year_num = year(Date),
Day_num = as.numeric(Day),
Date_num = date(Date)) %>%
select(Sales, Ad_Spend, Date, Phone_24, Positive_News, Negative_News,
Competition, Ultra_Edition, COVID_Lockdown,
Year, Month, Day, Day_num, Date_num, Month_num, Year_num)
dim(market_sale_final)## [1] 1796 16
glimpse(market_sale_final)## Rows: 1,796
## Columns: 16
## $ Sales <dbl> 66, 84, 78, 70, 73, 67, 85, 78, 80, 87, 68, 83, 62, 94,…
## $ Ad_Spend <dbl> 4199.86, 14768.20, 11019.79, 3358.30, 5207.19, 3962.76,…
## $ Date <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Phone_24 <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Positive_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ Negative_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year <fct> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
## $ Month <fct> January, January, January, January, January, January, J…
## $ Day <fct> Friday, Saturday, Sunday, Monday, Tuesday, Wednesday, T…
## $ Day_num <dbl> 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2…
## $ Date_num <date> 2016-01-01, 2016-01-02, 2016-01-03, 2016-01-04, 2016-0…
## $ Month_num <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Year_num <dbl> 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2016, 2…
dec_ad_final <-
dec_ad_raw %>%
mutate(
Date = dmy(Date),
Ad_Spend = AdvertisingSpend,
Phone_24 = factor(0, levels = c(0, 1)),
Positive_News = factor(0, levels = c(0, 1)),
Negative_News = factor(0, levels = c(0, 1)),
Competition = factor(0, levels = c(0, 1)),
Ultra_Edition = factor(1, levels = c(0, 1)),
COVID_Lockdown = factor(0, levels = c(0, 1)),
Year = factor(2020, levels = market_sale_final$Year |> levels() ),
Month = factor(Month, levels = month.name),
Day = factor(Day, levels =
c("Monday", "Tuesday", "Wednesday", "Thursday",
"Friday", "Saturday", "Sunday")),
Year_num = 2020,
Month_num = 12,
Day_num = as.numeric(Day),
Date_num = date(Date)
) %>%
select(Ad_Spend, Date, Phone_24, Positive_News, Negative_News,
Competition, Ultra_Edition,
COVID_Lockdown, Year, Month, Day, Day_num, Date_num, Month_num, Year_num)
dim(dec_ad_final)## [1] 31 15
glimpse(dec_ad_final)## Rows: 31
## Columns: 15
## $ Ad_Spend <dbl> 10568.28, 8218.31, 4907.38, 8057.25, 21481.50, 484.17, …
## $ Date <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Phone_24 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Positive_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Negative_News <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Competition <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Ultra_Edition <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ COVID_Lockdown <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ Year <fct> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
## $ Month <fct> December, December, December, December, December, Decem…
## $ Day <fct> Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday,…
## $ Day_num <dbl> 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6, 7, 1, 2, 3, 4, 5, 6…
## $ Date_num <date> 2020-12-01, 2020-12-02, 2020-12-03, 2020-12-04, 2020-1…
## $ Month_num <dbl> 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12,…
## $ Year_num <dbl> 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2…
I will first generate some plots to look for any relationship.
The first plot is total unit sold versus time.
ggplot(market_sale_final, aes(x = Date, y = Sales)) +
geom_path() +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light()There is clearing some seasonality on the total unit sold versus time for adjustment
The second plot is to examine relationship between the total unit sold versus the total advertising spend.
ggplot(market_sale_final, aes(x = Ad_Spend, y = Sales)) +
geom_point()There is a weak positive correlation between Advertising Spend and total unit sold, with the pearson correlation of 41.1%.
ggplot(market_sale_final, aes(x = Date)) +
geom_line(aes(y = Sales), linewidth = 1.5, color = "red", alpha = 0.8) +
geom_line(aes(y = Ad_Spend/100), color = "blue", alpha = 0.8) +
scale_y_continuous(
name = "The units sold",
# Add a second axis and specify its features
sec.axis = sec_axis(~.*100, name="Total Advertising Spend ($)")
) +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light() +
theme(
axis.title.y = element_text(color = "red", size=13),
axis.title.y.right = element_text(color = "blue", size=13)
)hist(market_sale_final$Sales)The overall number of Sales is looking reasonably normal, thus there is no need to apply any additional normalization methods.
Total units sold versus years.
market_sale_final %>%
group_by(Year) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales)) %>%
knitr::kable()| Year | Sales_mean | Sales_sum |
|---|---|---|
| 2016 | 89.48087 | 32750 |
| 2017 | 82.02740 | 29940 |
| 2018 | 76.81370 | 28037 |
| 2019 | 78.07671 | 28498 |
| 2020 | 66.04179 | 22124 |
The total units sold appears to decline from year to year.
Total units sold versus months
market_sale_final %>%
group_by(Month) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Month | Sales_mean | Sales_sum |
|---|---|---|
| January | 66.17419 | 10257 |
| February | 65.78169 | 9341 |
| March | 64.52258 | 10001 |
| April | 59.66000 | 8949 |
| May | 65.76129 | 10193 |
| June | 71.71333 | 10757 |
| July | 71.52258 | 11086 |
| August | 80.06452 | 12410 |
| September | 88.28000 | 13242 |
| October | 91.63226 | 14203 |
| November | 107.91333 | 16187 |
| December | 118.73387 | 14723 |
The most of total units sold appears to be at the later months of the year.
Total units sold versus days of the week.
market_sale_final %>%
group_by(Day) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Day | Sales_mean | Sales_sum |
|---|---|---|
| Monday | 78.26848 | 20115 |
| Tuesday | 78.85156 | 20186 |
| Wednesday | 77.59766 | 19865 |
| Thursday | 78.82422 | 20179 |
| Friday | 79.33852 | 20390 |
| Saturday | 78.82101 | 20257 |
| Sunday | 79.21012 | 20357 |
The total units sold appears to very similar between the days of each week.
market_sale_final %>%
group_by(Ultra_Edition) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Ultra_Edition | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 81.28650 | 74052 |
| 1 | 76.04181 | 67297 |
market_sale_final %>%
group_by(Year, Ultra_Edition) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Year | Ultra_Edition | Sales_mean | Sales_sum |
|---|---|---|---|
| 2016 | 0 | 89.48087 | 32750 |
| 2017 | 0 | 82.02740 | 29940 |
| 2018 | 0 | 63.12222 | 11362 |
| 2018 | 1 | 90.13514 | 16675 |
| 2019 | 1 | 78.07671 | 28498 |
| 2020 | 1 | 66.04179 | 22124 |
market_sale_final %>%
group_by(Positive_News) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Positive_News | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 78.22094 | 135948 |
| 1 | 93.12069 | 5401 |
market_sale_final %>%
group_by(Negative_News) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales)) %>%
knitr::kable()| Negative_News | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 78.85618 | 140364 |
| 1 | 61.56250 | 985 |
market_sale_final %>%
group_by(Positive_News, Negative_News) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()## `summarise()` has grouped output by 'Positive_News'. You can override using the
## `.groups` argument.
| Positive_News | Negative_News | Sales_mean | Sales_sum |
|---|---|---|---|
| 0 | 0 | 78.37573 | 134963 |
| 0 | 1 | 61.56250 | 985 |
| 1 | 0 | 93.12069 | 5401 |
market_sale_final %>%
group_by(Competition) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Competition | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 80.09143 | 111247 |
| 1 | 73.96069 | 30102 |
market_sale_final %>%
group_by(COVID_Lockdown) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| COVID_Lockdown | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 80.24484 | 139947 |
| 1 | 26.96154 | 1402 |
market_sale_final %>%
group_by(Phone_24) %>%
summarise(Sales_mean = mean(Sales),
Sales_sum = sum(Sales))%>%
knitr::kable()| Phone_24 | Sales_mean | Sales_sum |
|---|---|---|
| 0 | 76.10649 | 95057 |
| 1 | 84.62888 | 46292 |
linear_reg_fit <-
lm(
Sales ~ Ad_Spend + (Year * Month) / Day +
COVID_Lockdown + Ultra_Edition + Phone_24 +
Positive_News + Negative_News + Competition,
data = market_sale_final
)anova(linear_reg_fit) %>% knitr::kable() | Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Ad_Spend | 1 | 151426.5271 | 151426.52709 | 5321.5665377 | 0.0000000 |
| Year | 4 | 94607.6547 | 23651.91367 | 831.1967181 | 0.0000000 |
| Month | 11 | 517816.9361 | 47074.26692 | 1654.3260182 | 0.0000000 |
| COVID_Lockdown | 1 | 45137.4566 | 45137.45660 | 1586.2608966 | 0.0000000 |
| Ultra_Edition | 1 | 300.6070 | 300.60700 | 10.5642002 | 0.0011809 |
| Phone_24 | 1 | 10714.6209 | 10714.62092 | 376.5427976 | 0.0000000 |
| Positive_News | 1 | 9671.6523 | 9671.65230 | 339.8898608 | 0.0000000 |
| Negative_News | 1 | 7108.2051 | 7108.20513 | 249.8029063 | 0.0000000 |
| Competition | 1 | 108.5536 | 108.55356 | 3.8148865 | 0.0510013 |
| Year:Month | 42 | 13191.9256 | 314.09347 | 11.0381537 | 0.0000000 |
| Year:Month:Day | 354 | 8296.6084 | 23.43675 | 0.8236351 | 0.9874761 |
| Residuals | 1377 | 39182.8846 | 28.45525 | NA | NA |
linear_reg_fit_final <- step(linear_reg_fit) ## Start: AIC=6374.49
## Sales ~ Ad_Spend + (Year * Month)/Day + COVID_Lockdown + Ultra_Edition +
## Phone_24 + Positive_News + Negative_News + Competition
##
##
## Step: AIC=6374.49
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Competition + Year:Month +
## Year:Month:Day
##
## Df Sum of Sq RSS AIC
## - Year:Month:Day 354 8297 47479 6011.4
## - Ultra_Edition 1 21 39204 6373.5
## - Competition 1 39 39222 6374.3
## <none> 39183 6374.5
## - Negative_News 1 6008 45191 6628.7
## - Positive_News 1 7675 46858 6693.7
## - COVID_Lockdown 1 15661 54844 6976.4
## - Ad_Spend 1 114379 153562 8825.6
##
## Step: AIC=6011.43
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Competition + Year:Month
##
## Df Sum of Sq RSS AIC
## - Ultra_Edition 1 23 47502 6010.3
## - Competition 1 30 47510 6010.6
## <none> 47479 6011.4
## - Negative_News 1 8015 55494 6289.6
## - Positive_News 1 9093 56572 6324.1
## - COVID_Lockdown 1 15965 63444 6530.0
## - Year:Month 43 22732 70212 6628.0
## - Ad_Spend 1 135383 182862 8431.2
##
## Step: AIC=6010.29
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News +
## Negative_News + Competition + Year:Month
##
## Df Sum of Sq RSS AIC
## - Competition 1 30 47532 6009.4
## <none> 47502 6010.3
## - Negative_News 1 8015 55517 6288.3
## - Positive_News 1 9085 56588 6322.6
## - COVID_Lockdown 1 15965 63467 6528.7
## - Year:Month 43 23191 70694 6638.3
## - Ad_Spend 1 135379 182881 8429.4
##
## Step: AIC=6009.43
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News +
## Negative_News + Year:Month
##
## Df Sum of Sq RSS AIC
## <none> 47532 6009.4
## - Negative_News 1 8015 55547 6287.3
## - Positive_News 1 9081 56614 6321.4
## - COVID_Lockdown 1 15964 63497 6527.5
## - Year:Month 43 23830 71363 6653.3
## - Ad_Spend 1 135661 183193 8430.5
# Choose a model by AIC in a Stepwise Algorithm
linear_reg_fit_final$anova## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 NA NA 1377 39182.88 6374.490
## 2 - Phone_24 0 4.365575e-11 1377 39182.88 6374.490
## 3 - Year:Month:Day 354 8.296608e+03 1731 47479.49 6011.426
## 4 - Ultra_Edition 1 2.278925e+01 1732 47502.28 6010.288
## 5 - Competition 1 3.021734e+01 1733 47532.50 6009.430
anova(linear_reg_fit_final)## Analysis of Variance Table
##
## Response: Sales
## Df Sum Sq Mean Sq F value Pr(>F)
## Ad_Spend 1 151427 151427 5520.900 < 2.2e-16 ***
## Year 4 94608 23652 862.331 < 2.2e-16 ***
## Month 11 517817 47074 1716.293 < 2.2e-16 ***
## COVID_Lockdown 1 45137 45137 1645.678 < 2.2e-16 ***
## Positive_News 1 9692 9692 353.375 < 2.2e-16 ***
## Negative_News 1 7520 7520 274.174 < 2.2e-16 ***
## Year:Month 43 23830 554 20.205 < 2.2e-16 ***
## Residuals 1733 47532 27
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(linear_reg_fit_final)##
## Call:
## lm(formula = Sales ~ Ad_Spend + Year + Month + COVID_Lockdown +
## Positive_News + Negative_News + Year:Month, data = market_sale_final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.6621 -3.6102 -0.2807 3.3290 27.3765
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.565e+01 9.609e-01 68.324 < 2e-16 ***
## Ad_Spend 1.416e-03 2.014e-05 70.328 < 2e-16 ***
## Year2017 1.754e+00 1.331e+00 1.318 0.187765
## Year2018 -1.706e+01 1.331e+00 -12.817 < 2e-16 ***
## Year2019 -2.031e+01 1.331e+00 -15.255 < 2e-16 ***
## Year2020 -2.133e+01 1.331e+00 -16.025 < 2e-16 ***
## MonthFebruary -1.076e+00 1.355e+00 -0.794 0.427270
## MonthMarch 2.185e-01 1.331e+00 0.164 0.869649
## MonthApril -1.851e+00 1.342e+00 -1.379 0.168043
## MonthMay -2.427e+00 1.331e+00 -1.824 0.068390 .
## MonthJune 1.771e+00 1.342e+00 1.320 0.186943
## MonthJuly 9.950e-01 1.330e+00 0.748 0.454593
## MonthAugust 9.419e+00 1.332e+00 7.069 2.26e-12 ***
## MonthSeptember 1.897e+01 1.341e+00 14.143 < 2e-16 ***
## MonthOctober 2.745e+01 1.332e+00 20.605 < 2e-16 ***
## MonthNovember 4.178e+01 1.342e+00 31.126 < 2e-16 ***
## MonthDecember 4.905e+01 1.331e+00 36.860 < 2e-16 ***
## COVID_Lockdown1 -3.483e+01 1.444e+00 -24.125 < 2e-16 ***
## Positive_News1 1.290e+01 7.090e-01 18.196 < 2e-16 ***
## Negative_News1 -2.277e+01 1.332e+00 -17.094 < 2e-16 ***
## Year2017:MonthFebruary -1.025e+01 1.926e+00 -5.323 1.16e-07 ***
## Year2018:MonthFebruary 7.149e+00 1.924e+00 3.715 0.000210 ***
## Year2019:MonthFebruary 3.171e+00 1.924e+00 1.649 0.099408 .
## Year2020:MonthFebruary 5.005e+00 1.914e+00 2.615 0.008990 **
## Year2017:MonthMarch -3.546e+00 1.883e+00 -1.883 0.059855 .
## Year2018:MonthMarch 7.306e-01 1.883e+00 0.388 0.698001
## Year2019:MonthMarch 2.308e+00 1.882e+00 1.226 0.220208
## Year2020:MonthMarch 5.374e+00 1.910e+00 2.813 0.004961 **
## Year2017:MonthApril -6.831e+00 1.898e+00 -3.599 0.000329 ***
## Year2018:MonthApril 4.347e+00 1.898e+00 2.291 0.022106 *
## Year2019:MonthApril 8.629e+00 1.898e+00 4.546 5.86e-06 ***
## Year2020:MonthApril 8.313e+00 2.385e+00 3.486 0.000502 ***
## Year2017:MonthMay 1.501e+00 1.882e+00 0.797 0.425282
## Year2018:MonthMay 6.937e+00 1.882e+00 3.686 0.000235 ***
## Year2019:MonthMay 1.105e+01 1.882e+00 5.870 5.22e-09 ***
## Year2020:MonthMay 1.115e+01 2.007e+00 5.555 3.21e-08 ***
## Year2017:MonthJune -4.719e+00 1.898e+00 -2.487 0.012975 *
## Year2018:MonthJune 3.308e+00 1.897e+00 1.743 0.081469 .
## Year2019:MonthJune 1.280e+01 1.900e+00 6.740 2.14e-11 ***
## Year2020:MonthJune 8.612e+00 1.898e+00 4.538 6.06e-06 ***
## Year2017:MonthJuly -1.327e+01 1.882e+00 -7.050 2.57e-12 ***
## Year2018:MonthJuly 8.134e+00 1.881e+00 4.324 1.62e-05 ***
## Year2019:MonthJuly 1.871e+01 1.881e+00 9.947 < 2e-16 ***
## Year2020:MonthJuly 8.028e+00 1.882e+00 4.265 2.11e-05 ***
## Year2017:MonthAugust -9.244e+00 1.885e+00 -4.904 1.02e-06 ***
## Year2018:MonthAugust 5.231e+00 1.884e+00 2.776 0.005563 **
## Year2019:MonthAugust 1.030e+01 1.883e+00 5.472 5.09e-08 ***
## Year2020:MonthAugust 1.518e+01 1.885e+00 8.052 1.50e-15 ***
## Year2017:MonthSeptember -1.093e+01 1.897e+00 -5.760 9.95e-09 ***
## Year2018:MonthSeptember 1.773e+00 1.897e+00 0.935 0.350071
## Year2019:MonthSeptember 1.298e+01 1.898e+00 6.839 1.10e-11 ***
## Year2020:MonthSeptember 1.043e+01 1.897e+00 5.496 4.46e-08 ***
## Year2017:MonthOctober -1.429e+01 1.884e+00 -7.584 5.43e-14 ***
## Year2018:MonthOctober 2.755e+00 1.883e+00 1.463 0.143639
## Year2019:MonthOctober 5.206e+00 1.885e+00 2.762 0.005803 **
## Year2020:MonthOctober 8.350e+00 1.882e+00 4.437 9.69e-06 ***
## Year2017:MonthNovember -1.919e+01 1.897e+00 -10.118 < 2e-16 ***
## Year2018:MonthNovember 7.326e+00 1.898e+00 3.860 0.000117 ***
## Year2019:MonthNovember 1.230e+01 1.898e+00 6.482 1.18e-10 ***
## Year2020:MonthNovember 4.986e+00 1.899e+00 2.626 0.008728 **
## Year2017:MonthDecember -1.355e+01 1.882e+00 -7.202 8.83e-13 ***
## Year2018:MonthDecember 6.887e+00 1.884e+00 3.656 0.000264 ***
## Year2019:MonthDecember 9.716e+00 1.882e+00 5.161 2.73e-07 ***
## Year2020:MonthDecember NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.237 on 1733 degrees of freedom
## Multiple R-squared: 0.947, Adjusted R-squared: 0.9451
## F-statistic: 499.9 on 62 and 1733 DF, p-value: < 2.2e-16
hist(linear_reg_fit_final$residuals)plot(fitted(linear_reg_fit_final), resid(linear_reg_fit_final) )# calculate the mean squared error
mse_linear <-
sum(market_sale_final$Sales -
predict.glm(linear_reg_fit_final,
newdata = market_sale_final,
type = "response") ) ^2## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
poisson_reg_fit <-
glm(
Sales ~ Ad_Spend + (Year * Month) / Day +
COVID_Lockdown + Ultra_Edition + Phone_24 +
Positive_News + Negative_News + Competition,
data = market_sale_final,
family = poisson()
)
car::Anova(poisson_reg_fit)## Analysis of Deviance Table (Type II tests)
##
## Response: Sales
## LR Chisq Df Pr(>Chisq)
## Ad_Spend 1364.9 1 < 2.2e-16 ***
## Year 15.4 4 0.003935 **
## Month 3501.8 11 < 2.2e-16 ***
## COVID_Lockdown 356.0 1 < 2.2e-16 ***
## Ultra_Edition 0.4 1 0.506484
## Phone_24 0
## Positive_News 88.1 1 < 2.2e-16 ***
## Negative_News 80.1 1 < 2.2e-16 ***
## Competition 0.3 1 0.594721
## Year:Month 216.2 42 < 2.2e-16 ***
## Year:Month:Day 131.5 354 1.000000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
poisson_reg_fit_final <- step(poisson_reg_fit)## Start: AIC=12543.89
## Sales ~ Ad_Spend + (Year * Month)/Day + COVID_Lockdown + Ultra_Edition +
## Phone_24 + Positive_News + Negative_News + Competition
##
##
## Step: AIC=12543.89
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Competition + Year:Month +
## Year:Month:Day
##
## Df Deviance AIC
## - Year:Month:Day 354 772.78 11967
## - Competition 1 641.53 12542
## - Ultra_Edition 1 641.69 12542
## <none> 641.25 12544
## - Negative_News 1 721.35 12622
## - Positive_News 1 729.31 12630
## - COVID_Lockdown 1 997.21 12898
## - Ad_Spend 1 2006.15 13907
##
## Step: AIC=11967.41
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Competition + Year:Month
##
## Df Deviance AIC
## - Competition 1 772.98 11966
## - Ultra_Edition 1 773.15 11966
## <none> 772.78 11967
## - Negative_News 1 875.60 12068
## - Positive_News 1 877.81 12070
## - Year:Month 43 1106.23 12215
## - COVID_Lockdown 1 1130.86 12324
## - Ad_Spend 1 2394.15 13587
##
## Step: AIC=11965.61
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Ultra_Edition +
## Positive_News + Negative_News + Year:Month
##
## Df Deviance AIC
## - Ultra_Edition 1 773.35 11964
## <none> 772.98 11966
## - Negative_News 1 875.80 12066
## - Positive_News 1 877.97 12069
## - Year:Month 43 1122.66 12229
## - COVID_Lockdown 1 1131.06 12322
## - Ad_Spend 1 2397.68 13588
##
## Step: AIC=11963.99
## Sales ~ Ad_Spend + Year + Month + COVID_Lockdown + Positive_News +
## Negative_News + Year:Month
##
## Df Deviance AIC
## <none> 773.35 11964
## - Negative_News 1 876.17 12065
## - Positive_News 1 878.24 12067
## - Year:Month 43 1135.63 12240
## - COVID_Lockdown 1 1131.43 12320
## - Ad_Spend 1 2397.95 13587
car::Anova(poisson_reg_fit_final)## Analysis of Deviance Table (Type II tests)
##
## Response: Sales
## LR Chisq Df Pr(>Chisq)
## Ad_Spend 1624.6 1 < 2.2e-16 ***
## Year 537.0 4 < 2.2e-16 ***
## Month 5536.1 11 < 2.2e-16 ***
## COVID_Lockdown 358.1 1 < 2.2e-16 ***
## Positive_News 104.9 1 < 2.2e-16 ***
## Negative_News 102.8 1 < 2.2e-16 ***
## Year:Month 362.3 43 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
hist(poisson_reg_fit$residuals)plot(fitted(poisson_reg_fit), resid(poisson_reg_fit) )AIC(poisson_reg_fit_final)## [1] 11963.99
# calculate the mean squared error
mse_poisson <-
sum(market_sale_final$Sales -
predict.glm(poisson_reg_fit_final,
newdata = market_sale_final,
type = "response") ) ^2## Warning in predict.lm(object, newdata, se.fit, scale = 1, type = if (type == :
## prediction from a rank-deficient fit may be misleading
linear_reg_predict <-
predict.lm(linear_reg_fit_final,
newdata = dec_ad_final,
type = "response",
se.fit = TRUE) ## Warning in predict.lm(linear_reg_fit_final, newdata = dec_ad_final, type =
## "response", : prediction from a rank-deficient fit may be misleading
dec_ad_final$Sales <- round(linear_reg_predict$fit)
dec_ad_final %>%
select(Date, Sales) %>%
knitr::kable()| Date | Sales |
|---|---|
| 2020-12-01 | 108 |
| 2020-12-02 | 105 |
| 2020-12-03 | 100 |
| 2020-12-04 | 105 |
| 2020-12-05 | 124 |
| 2020-12-06 | 94 |
| 2020-12-07 | 111 |
| 2020-12-08 | 106 |
| 2020-12-09 | 116 |
| 2020-12-10 | 95 |
| 2020-12-11 | 99 |
| 2020-12-12 | 112 |
| 2020-12-13 | 123 |
| 2020-12-14 | 102 |
| 2020-12-15 | 100 |
| 2020-12-16 | 109 |
| 2020-12-17 | 105 |
| 2020-12-18 | 98 |
| 2020-12-19 | 106 |
| 2020-12-20 | 121 |
| 2020-12-21 | 98 |
| 2020-12-22 | 126 |
| 2020-12-23 | 103 |
| 2020-12-24 | 116 |
| 2020-12-25 | 94 |
| 2020-12-26 | 113 |
| 2020-12-27 | 95 |
| 2020-12-28 | 94 |
| 2020-12-29 | 135 |
| 2020-12-30 | 111 |
| 2020-12-31 | 107 |
The linear regression model predicts the total unit sales in December 2020 is 3331 with 95% confidence intervals of 3231 and 3430. Thus, the model suggests that the target sale of 3900 units in December 2020 is unlikely to meet.
market_sale_final$Type <- "Historical"
dec_ad_final$Type <- "Forecasted"
dec_ad_final$Sales_max <- with(linear_reg_predict, fit + 1.96*se.fit)
dec_ad_final$Sales_min <- with(linear_reg_predict, fit - 1.96*se.fit)
market_sale_final$Sales_max <-
market_sale_final$Sales_min <-
market_sale_final$Sales
g <-
market_sale_final %>%
bind_rows(dec_ad_final) %>%
mutate(Type = factor(Type)) %>%
ggplot(aes(x = Date, y = Sales, col = Type, group = Type)) +
geom_path() +
geom_ribbon(aes(ymin = Sales_min, ymax = Sales_max, fill = Type), alpha = 0.2) +
scale_x_date(date_breaks = "6 month", expand = c(0.01, 0.01),
date_labels = "%b %Y") +
theme_light()
plotly::ggplotly(g)In conclusion, multiple linear regression appears to perform the best here
The model explains 94.51% of the variation in the total unit sold and hsould therefore be an accurate model for prediction.
An alternative method is using a time-series-based
analysis with the forecast R package, which can
better handle autocorrelation, seasonality, and other temporal patterns
in the data. However, applying the method within the
forecast R package requires the dataset to be converted to
a Time-Series (ts) object. This process can be complicated,
especially if we need to include other predictors such as “Total
Advertising Spend” in the model for inference and prediction.